Similarity preserving compressions of high dimensional sparse data
نویسندگان
چکیده
The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such high dimensional big datasets is becoming increasingly common. Even with the rapid growth in computing power, the bruteforce search for such a task is impractical and at times impossible. Therefore algorithmic solutions such as Locality Sensitive Hashing (LSH) are required to achieve the desired efficiency in search. Any similarity search method that achieves the efficiency uses one (or both) of the following methods: 1. Compress the data by reducing its dimension while preserving the similarities between any pair of data-objects 2. Limit the search space by grouping the data-objects based on their similarities. Typically 2 is obtained as a consequence of 1. Our focus is on high dimensional sparse data, where the standard compression schemes, such as LSH for Hamming distance (Gionis, Indyk and Motwani [7]), become inefficient in both 1 and 2 due to at least one of the following reasons: 1. No efficient compression schemes mapping binary vectors to binary vectors 2. Compression length is nearly linear in the dimension and grows inversely with the sparsity 3. Randomness used grows linearly with the product of dimension and compression length. We propose an efficient compression scheme mapping binary vectors into binary vectors and simultaneously preserving Hamming distance and Inner Product. Our schemes avoid all the above mentioned drawbacks for high dimensional sparse data. The length of our compression depends only on the sparsity and is independent of the dimension of the data. Moreover our schemes provide one-shot solution for Hamming distance and Inner Product, and work in the streaming setting as well. In contrast with the “local projection” strategies used by most of the previous schemes, our scheme combines (using sparsity) the following two strategies: 1. Partitioning the dimensions into several buckets, 2. Then obtaining “global linear summaries” in each of these buckets. We generalize our scheme for real-valued data and obtain compressions for Euclidean distance, Inner Product, and k-way Inner Product.
منابع مشابه
Hyperspectral Image Classification Based on the Fusion of the Features Generated by Sparse Representation Methods, Linear and Non-linear Transformations
The ability of recording the high resolution spectral signature of earth surface would be the most important feature of hyperspectral sensors. On the other hand, classification of hyperspectral imagery is known as one of the methods to extracting information from these remote sensing data sources. Despite the high potential of hyperspectral images in the information content point of view, there...
متن کاملA Geometry Preserving Kernel over Riemannian Manifolds
Abstract- Kernel trick and projection to tangent spaces are two choices for linearizing the data points lying on Riemannian manifolds. These approaches are used to provide the prerequisites for applying standard machine learning methods on Riemannian manifolds. Classical kernels implicitly project data to high dimensional feature space without considering the intrinsic geometry of data points. ...
متن کاملISIS: A New Approach for Efficient Similarity Search in Sparse Databases
High-dimensional sparse data is prevalent in many real-life applications. In this paper, we propose a novel index structure for accelerating similarity search in high-dimensional sparse databases, named ISIS, which stands for Indexing Sparse databases using Inverted fileS. ISIS clusters a dataset and converts the original high-dimensional space into a new space where each dimension represents a...
متن کاملMammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease
Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...
متن کاملA Class of Region-preserving Space Transformations for Indexing High-dimensional Data
This study introduces a class of region preserving space transformation (RPST) schemes for accessing high-dimensional data. The access methods in this class differ with respect to their spacepartitioning strategies. The study develops two new static partitioning schemes that can split each dimension of the space within linear space complexity. They also support an effective mechanism for handli...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1612.06057 شماره
صفحات -
تاریخ انتشار 2016